Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul enable tag override master #5774

Conversation

bogdanalbei
Copy link

At the moment the Consul's EnableTagOverride flag not being set by Nomad, so the default(false) is always used. In certain cases it is useful for services outside Nomad to update the tags of the Consul services, however for that to happen EnableTagOverride has to be set to true.
This change enables setting EnableTagOverride at a service level.
An older request describing this can be found here #2057

Preetha Appan and others added 30 commits April 22, 2019 15:29
In Nomad 0.9, we made volume driver handling the same for `""`, and
`"local"` volumes. Prior to Nomad 0.9 however these had slightly different
behaviour for relative paths and named volumes.

Prior to 0.9 the empty string would expand relative paths within the task
dir, and `"local"` volumes that are not absolute paths would be treated
as docker named volumes.

This commit reverts to the previous behaviour as follows:

| Nomad Version | Driver  |   Volume Spec    | Behaviour                 |
|-------------------------------------------------------------------------
| all           | ""      | testing:/testing | allocdir/testing          |
| 0.8.7         | "local" | testing:/testing | "testing" as named volume |
| 0.9.0         | "local" | testing:/testing | allocdir/testing          |
| 0.9.1         | "local" | testing:/testing | "testing" as named volume |
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes hashicorp#5579
I noticed that `watchNodeUpdates()` almost immediately after
`registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5
seconds.

This call is unnecessary and made debugging a bit harder.  So here, we
ensure that we only re-register node for new node events, not for
initial registration.
Noticed that `detected drivers` log line was misleading - when a driver
doesn't fingerprint before timeout, their health status is empty string
`""` which we would mark as detected.

Now, we log all drivers along with their state to ease driver
fingerprint debugging.
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
Co-Authored-By: notnoop <mahmood@notnoop.com>
Fixes hashicorp#5593

Executor seems to die unexpectedly after nomad agent dies or is
restarted.  The crash seems to occur at the first log message after
the nomad agent dies.

To ease debugging we forward executor log messages to executor.log as
well as to Stderr.  `go-plugin` sets up plugins with Stderr pointing to
a pipe being read by plugin client, the nomad agent in our case[1].
When the nomad agent dies, the pipe is closed, and any subsequent
executor logs fail with ErrClosedPipe and SIGPIPE signal.  SIGPIPE
results into executor process dying.

I considered adding a handler to ignore SIGPIPE, but hc-log library
currently panics when logging write operation fails[2]

This we opt to revert to v0.8 behavior of exclusively writing logs to
executor.log, while we investigate alternative options.

[1] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-plugin/client.go#L528-L535
[2] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-hclog/int.go#L320-L323
driver/docker: collect tty container logs
…-errs-2

logmon: recover from shutting down call locally
…omad into consul_enable_tag_override_master

# Conflicts:
#	CHANGELOG.md
#	client/client.go
#	command/agent/bindata_assetfs.go
#	drivers/docker/docklog/docker_logger.go
#	drivers/docker/driver.go
#	version/version.go
@hashicorp-cla
Copy link

hashicorp-cla commented Jun 4, 2019

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes


5 out of 6 committers have signed the CLA.

  • preetapan
  • notnoop
  • endocrimes
  • schmichael
  • cgbaker
  • Nomad Release bot

Nomad Release bot seems not to be a GitHub user.
You need a GitHub account to be able to sign the CLA. If you already have a GitHub account, please add the email address used for this commit to your account.

Have you signed the CLA already but the status is still pending? Recheck it.

@bogdanalbei
Copy link
Author

Created a different PR, with a cleaner commit history #5775

@bogdanalbei bogdanalbei closed this Jun 4, 2019
@github-actions
Copy link

github-actions bot commented Feb 9, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants